One of the most controversial issues in the U.S. educational system is the efficacy of standardized tests, and whether they're unfair to certain groups. Given my prior knowledge of this topic, investigating the correlations between SAT scores and demographics seemed like an interesting angle to take. I could correlate SAT scores with factors like race, gender, income, and more.
The SAT, or Scholastic Aptitude Test, is an exam that U.S. high school students take before applying to college. Colleges take the test scores into account when deciding who to admit, so it's fairly important to perform well on it.
The test consists of three sections, each of which has 800 possible points. The combined score is out of 2,400 possible points (while this number has changed a few times, the data set for our project is based on 2,400 total points). Organizations often rank high schools by their average SAT scores. The scores are also considered a measure of overall school district quality.
New York City makes its data on high school SAT scores available online, as well as the demographics for each high school.
The first few rows of the SAT data look like this:

Further exploration of these datasets revealed that the combination of them didn't give me all the demographic information I wanted to use for my full analysis.
The same website had several related data sets covering demographic information and test scores. Here are the links to all of the data sets I used:
In my case, I researched the following resources,
and drew a list of observations relevant for my analysis:
Here are all of the files in the folder schools:
import pandas as pd
data_files = [
"ap_2010.csv",
"class_size.csv",
"demographics.csv",
"graduation.csv",
"hs_directory.csv",
"sat_results.csv"
]
data = {}
for f in data_files:
d = pd.read_csv("schools/{0}".format(f))
key = f.replace(".csv", "")
data[key] = d
data['sat_results'].head()
data['ap_2010'].head()
data['class_size'].head()
data['demographics'].head()
data['hs_directory'].head()
data['sat_results'].head()
A part of each of the survey text files looks like this:
dbn bn schoolname d75 studentssurveyed highschool schooltype rr_s "01M015" "M015" "P.S. 015 Roberto Clemente" 0 "No" 0 "Elementary School" 88
all_survey = pd.read_csv("schools/survey_all.txt", delimiter="\t", encoding="windows-1252")
d75_survey = pd.read_csv("schools/survey_d75.txt", delimiter="\t", encoding="windows-1252")
survey = pd.concat([all_survey, d75_survey], axis = 0)
survey.head()
Based on the dictionary, the relevant columns are:
survey['DBN'] = survey['dbn']
survey = survey.loc[:,["DBN", "rr_s", "rr_t", "rr_p", "N_s", "N_t", "N_p", "saf_p_11", "com_p_11", "eng_p_11", "aca_p_11", "saf_t_11", "com_t_11", "eng_t_11", "aca_t_11", "saf_s_11", "com_s_11", "eng_s_11", "aca_s_11", "saf_tot_11", "com_tot_11", "eng_tot_11", "aca_tot_11"]]
data['survey'] = survey
data['survey'].shape
Exploration of the remaining data, showed me that certain dataframes like class_size and hs_directory, don't have a DBN column. hs_directory does have a dbn column, which can be renamed to DBN.
From looking at the first few rows in sat_results, we can tell that the DBN in the sat_results data is just a combination of the CSD and SCHOOL CODE columns in the class_size data. The main difference is that the DBN is padded, so that the CSD portion of it always consists of two digits.
data['hs_directory']['DBN'] = data['hs_directory']['dbn']
def padding_csd(num):
num = str(num)
if len(num) == 2:
return num
elif len(num) == 1:
return num.zfill(2)
data['class_size']['padded_csd'] = data['class_size']['CSD'].apply(padding_csd)
data['class_size']['DBN'] = data['class_size']['padded_csd'] + data['class_size']['SCHOOL CODE']
data['class_size']['DBN'].head()
cols = ['SAT Math Avg. Score', 'SAT Critical Reading Avg. Score', 'SAT Writing Avg. Score']
for col in cols:
data['sat_results'][col] = pd.to_numeric(data['sat_results'][col], errors = 'coerce')
data['sat_results']['sat_score'] = data['sat_results'][cols[0]] + data['sat_results'][cols[1]] + data['sat_results'][cols[2]]
data['sat_results']['sat_score'].head()
data['hs_directory']['Location 1'].head()
import re
def find_lat(loc):
coords = re.findall("\(.+\)", loc)
lat = coords[0].split(',')[0].replace("(", "")
return lat
data['hs_directory']['lat'] = data['hs_directory']['Location 1'].apply(find_lat)
data['hs_directory'].head()
def find_long(loc):
coords = re.findall("\(.+\)", loc)
long = coords[0].split(',')[1].replace(")", "")
return long
data['hs_directory']['long'] = data['hs_directory']['Location 1'].apply(find_long)
data['hs_directory']['lat'] = pd.to_numeric(data['hs_directory']['lat'], errors = 'coerce')
data['hs_directory']['long'] = pd.to_numeric(data['hs_directory']['long'], errors = 'coerce')
data['hs_directory'].head()
data['class_size'].head()
# Fix key error for 'GRADE '
data['class_size']['GRADE'] = data['class_size']['GRADE ']
# Exploring unique values for 'GRADE'
data['class_size']['GRADE'].unique()
Because we're dealing with high schools, we're only concerned with grades 9 through 12
# Exploring unique values for 'PROGRAM TYPE'
data['class_size']['PROGRAM TYPE'].unique()
Each school can have multiple program types. Because GEN ED is the largest category by far, I only selected rows where PROGRAM TYPE is GEN ED.
class_size = data['class_size']
class_size = class_size[class_size['GRADE'] == '09-12' ]
class_size = class_size[class_size['PROGRAM TYPE'] == 'GEN ED' ]
class_size.head()
DBN still isn't completely unique. This is due to the CORE COURSE (MS CORE and 9-12 ONLY) and CORE SUBJECT (MS CORE and 9-12 ONLY) columns.
# Exploring unique values for 'CORE SUBJECT (MS CORE and 9-12 ONLY)'
data['class_size']['CORE SUBJECT (MS CORE and 9-12 ONLY)'].unique()
import numpy as np
class_size = class_size.groupby('DBN').agg(np.mean)
class_size.reset_index(inplace = True)
data["class_size"] = class_size
data["class_size"].head()
data["demographics"].head(7)
data["demographics"]['schoolyear'].dtype
data['demographics'] = data["demographics"][data["demographics"]['schoolyear'] == 20112012]
data['demographics'].head()
data['graduation'].head()
data['graduation']['Demographic'].unique()
data['graduation']['Cohort'].unique()
data['graduation'] = data['graduation'][data['graduation']['Cohort'] == '2006']
data['graduation'] = data['graduation'][data['graduation']['Demographic'] == 'Total Cohort']
data['graduation'].head()
High school students take the Advanced Placement (AP) exams before applying to college. There are several AP exams, each corresponding to a school subject. High school students who earn high scores may receive college credit.
AP exams have a 1 to 5 scale; 3 or higher is a passing score. Many high school students take AP exams -- particularly those who attend academically challenging institutions. AP exams are much more rare in schools that lack funding or academic rigor.
It will be interesting to find out whether AP exam scores are correlated with SAT scores across high schools.
data['ap_2010'].head()
list(data['ap_2010'])
data['ap_2010']['AP Test Takers'] = data['ap_2010']['AP Test Takers ']
cols = ['AP Test Takers', 'Total Exams Taken', 'Number of Exams with scores 3 4 or 5']
for col in cols:
data['ap_2010'][col] = pd.to_numeric(data['ap_2010'][col], errors="coerce")
data['ap_2010'].head()
combined = data["sat_results"]
combined = combined.merge(data['ap_2010'], how = "left", on = 'DBN')
combined = combined.merge(data['graduation'], how = "left", on = 'DBN')
combined.head()
combined.shape
to_merge = ["class_size", "demographics", "survey", "hs_directory"]
for m in to_merge:
combined = combined.merge(data[m], on="DBN", how="inner")
combined.head()
combined.shape
combined = combined.fillna(combined.mean())
combined = combined.fillna(0)
combined.head()
def first_two(s):
return s[0:2]
combined['school_dist'] = combined['DBN'].apply(first_two)
combined['school_dist'].head()
correlations = combined.corr()
correlations = correlations['sat_score']
correlations
import plotly as py
import plotly.graph_objs as go
import numpy as np
py.offline.init_notebook_mode(connected=True)
trace = go.Scatter(
x = combined['total_enrollment'],
y = combined['sat_score'],
mode = 'markers',
text = combined['SCHOOL NAME']
)
layout = go.Layout(
title= "total_enrollment vs. sat_score",
hovermode= 'closest',
xaxis=dict(
title='total_enrollment',
ticklen=5,
zeroline=False,
gridwidth=2,
),
yaxis=dict(
title='sat_score',
ticklen=5,
gridwidth=2,
)
)
data = go.Data([trace])
fig = go.Figure(data = data, layout = layout)
py.offline.iplot(fig)
low_enrollment = combined[combined['total_enrollment'] < 1000]
low_enrollment = combined[combined['sat_score'] < 1000]
low_enrollment['School Name']
Researching the above schools revealed that most of the high schools with low total enrollment and low SAT scores have high percentages of English language learners. This indicates that it's actually ell_percent that correlates strongly with sat_score, rather than total_enrollment.
trace = go.Scatter(
x = combined['ell_percent'],
y = combined['sat_score'],
mode = 'markers',
text = combined['SCHOOL NAME']
)
layout = go.Layout(
title= "ell_percent vs. sat_score",
hovermode= 'closest',
xaxis=dict(
title='ell_percent',
ticklen=5,
zeroline=False,
gridwidth=2,
),
yaxis=dict(
title='sat_score',
ticklen=5,
gridwidth=2,
)
)
data = go.Data([trace])
fig = go.Figure(data = data, layout = layout)
py.offline.iplot(fig)
ell_percent correlates with sat_score more strongly, because the scatterplot is more linear. However, there's still the cluster of schools that have very high ell_percent values and low sat_score values. This cluster represents the same group of international high schools listed earlier.
import folium
from folium.plugins import MarkerCluster
m = folium.Map(location= [combined['lat'].mean(), combined['long'].mean()], zoom_start = 10, tiles='Stamen Terrain')
marker_cluster = MarkerCluster().add_to(m)
for name, row in combined.iterrows():
popup = folium.Popup("{0}: {1}".format(row["DBN"], row["SCHOOL NAME"]), parse_html=True)
folium.Marker([row["lat"], row["long"]], popup= popup).add_to(marker_cluster)
m
m2 = folium.Map(location= [combined['lat'].mean(), combined['long'].mean()], zoom_start = 10, tiles='Stamen Terrain')
m2.add_child(folium.plugins.HeatMap([[row['lat'], row['long']] for name, row in combined.iterrows()]))
m2.save('schools_heatmap.html')
m2
It is evident from the marker cluster and heat map that school density is highest in Manhattan, and lower in Brooklyn, the Bronx, Queens, and Staten Island.
Although heatmaps are good for gradient mapping, it lacks the structure needed to accurately plot out the differences in SAT Scores across the city. Mapping school districts is a better way to visualize this information, as each district has its own administration. New York City has several dozen school districts, and each district is a small geographic area. I mapped the districts using a Choropleth map.
import numpy as np
districts = combined.groupby('school_dist').agg(np.mean)
districts.reset_index(inplace = True)
districts.head()
# remove leading 0s
districts["school_dist"] = districts["school_dist"].apply(lambda x: str(int(x)))
import os
def district_map(col):
dist_geo = os.path.join('schools','NYC School District Boundaries.geojson')
dist = folium.Map(location=[districts['lat'].mean(), districts['long'].mean()], zoom_start=10, tiles='cartodbpositron')
dist.choropleth(
geo_data=dist_geo,
data=districts,
columns=['school_dist', col],
key_on='feature.properties.schooldist',
fill_color='YlOrRd',
fill_opacity=0.7,
line_opacity=0.2,
legend_name=col
)
folium.LayerControl().add_to(dist)
return dist
district_map("sat_score")
In the scatter plot of english language learners percentage (ell_percent) vs. SAT scores (sat_score) I observed a group of schools with a high ell_percentage that also have low average SAT scores. This can be extended to district level to observe patterns in SAT scores by district.
district_map('ell_percent')
Districts with a low proportion of English language learners tend to have high SAT scores, and vice versa.
trace = go.Bar(
y=combined.corr()['sat_score'][["DBN", "rr_s", "rr_t", "rr_p", "N_s", "N_t", "N_p", "saf_p_11", "com_p_11", "eng_p_11", "aca_p_11",
"saf_t_11", "com_t_11", "eng_t_11", "aca_t_11", "saf_s_11", "com_s_11", "eng_s_11", "aca_s_11", "saf_tot_11",
"com_tot_11", "eng_tot_11", "aca_tot_11"]],
name='Survey Fields',
text = ["DBN", "rr_s", "rr_t", "rr_p", "N_s", "N_t", "N_p", "saf_p_11", "com_p_11", "eng_p_11", "aca_p_11",
"saf_t_11", "com_t_11", "eng_t_11", "aca_t_11", "saf_s_11", "com_s_11", "eng_s_11", "aca_s_11", "saf_tot_11",
"com_tot_11", "eng_tot_11", "aca_tot_11"]
)
data = [trace]
layout = go.Layout(
title='Survey Correlations',
xaxis=dict(
title='Survey Fields'
),
yaxis=dict(
title='r-value'
)
)
fig = go.Figure(data=data, layout=layout)
py.offline.iplot(fig)
There are high correlations between N_s, N_t, N_p and sat_score. Since these columns are correlated with total_enrollment, it makes sense that they would be high.
It is more interesting that rr_s, the student response rate, or the percentage of students that completed the survey, correlates with sat_score. This might make sense because students who are more likely to fill out surveys may be more likely to also be doing well academically.
How students and teachers percieved safety (saf_t_11 and saf_s_11) correlate with sat_score. This make sense, as it's hard to teach or learn in an unsafe environment.
The last interesting correlation is the aca_s_11, which indicates how the student perceives academic standards, correlates with sat_score, but this is not true for aca_t_11, how teachers perceive academic standards, or aca_p_11, how parents perceive academic standards.
trace = go.Scatter(
x = combined['saf_s_11'],
y = combined['sat_score'],
mode = 'markers',
text = combined['SCHOOL NAME']
)
layout = go.Layout(
title= "saf_s_11 vs. sat_score",
hovermode= 'closest',
xaxis=dict(
title='ell_percent'
),
yaxis=dict(
title='sat_score',
)
)
data = go.Data([trace])
fig = go.Figure(data = data, layout = layout)
py.offline.iplot(fig)
There appears to be a correlation between SAT scores and safety, although it isn't that strong. It looks like there are a few schools with extremely high SAT scores and high safety scores. There are a few schools with low safety scores and low SAT scores. No school with a safety score lower than 6.5 has an average SAT score higher than 1500 or so.
district_map('saf_s_11')
It looks like Upper Manhattan and parts of Queens and the Bronx tend to have lower safety scores, whereas Brooklyn has high safety scores.
trace = go.Bar(
y=combined.corr()['sat_score'][["white_per", "asian_per", "black_per", "hispanic_per"]],
name='Race Correlations',
text = ["white_per", "asian_per", "black_per", "hispanic_per"]
)
data = [trace]
layout = go.Layout(
title='Race Correlations',
xaxis=dict(
title='Race Fields'
),
yaxis=dict(
title='r-value'
)
)
fig = go.Figure(data=data, layout=layout)
py.offline.iplot(fig)
It looks like a higher percentage of white or asian students at a school correlates positively with SAT score, whereas a higher percentage of black or hispanic students correlates negatively with SAT score. This may be due to a lack of funding for schools in certain areas, which are more likely to have a higher percentage of black or hispanic students.
trace = go.Scatter(
x = combined['hispanic_per'],
y = combined['sat_score'],
mode = 'markers',
text = combined['SCHOOL NAME']
)
layout = go.Layout(
title= "hispanic_per vs. sat_score",
hovermode= 'closest',
xaxis=dict(
title='hispanic_per'
),
yaxis=dict(
title='sat_score',
)
)
data = go.Data([trace])
fig = go.Figure(data = data, layout = layout)
py.offline.iplot(fig)
combined[combined["hispanic_per"] > 95]["SCHOOL NAME"]
The schools listed above appear to primarily be geared towards recent immigrants in the US. These schools have a lot of students who are learning English, which would explain the lower SAT scores.
combined[(combined["hispanic_per"] < 10) & (combined["sat_score"] > 1800)]["SCHOOL NAME"]
Many of the schools above appear to be specialized science and technology schools that receive extra funding, and only admit students who pass an entrance exam. This doesn't explain the low hispanic_per, but it does explain why their students tend to do better on the SAT -- they are students from all over New York City who did well on a standardized test.
trace = go.Bar(
y=combined.corr()['sat_score'][["male_per", "female_per"]],
name='Race Correlations',
text = ["male_per", "female_per"]
)
data = [trace]
layout = go.Layout(
title='Gender Correlations',
xaxis=dict(
title='Gender Fields'
),
yaxis=dict(
title='r-value'
)
)
fig = go.Figure(data=data, layout=layout)
py.offline.iplot(fig)
In the plot above, we can see that a high percentage of females at a school positively correlates with SAT score, whereas a high percentage of males at a school negatively correlates with SAT score. Neither correlation is extremely strong.
trace = go.Scatter(
x = combined['female_per'],
y = combined['sat_score'],
mode = 'markers',
text = combined['SCHOOL NAME']
)
layout = go.Layout(
title= "female_per vs. sat_score",
hovermode= 'closest',
xaxis=dict(
title='female_per'
),
yaxis=dict(
title='sat_score',
)
)
data = go.Data([trace])
fig = go.Figure(data = data, layout = layout)
py.offline.iplot(fig)
Based on the scatterplot, there doesn't seem to be any real correlation between sat_score and female_per. However, there is a cluster of schools with a high percentage of females (60 to 80), and high SAT scores.
combined[(combined["female_per"] > 60) & (combined["sat_score"] > 1700)]["SCHOOL NAME"]
These schools appears to be very selective liberal arts schools that have high academic standards.